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ABSTRACT 

The field of machine learning has developed a wide 
array of techniques for improving the effectiveness of performance 
elements. Ideally, a learning system would adapt its commitments to 
the demands of a particular learning situation, rather than relying 
on fixed commitments that impose tradeoffs between the efficiency and 
utility of a learning technique. This article presents an extension 
of the COMPOSER learning approach that dynamically adjusts its 
learning behavior based on the resources available for learning. 
COMPOSER is a speed-up learning technique that provides a statistical 
approach to the utility problem. The system identifies a sequence of 
transformations that, with high probability, increase the Type I 
utility of an initial planning system. The approach breaks the task 
into a learning phase and a utilization phase. This extension to 
COMPOSER adopts a rational policy that dynamically balances the 
trade-off between efficiency and utility. Implications for learning 
systems are discussed. (Contains 24 references.) (SLD) 
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1. INTRODUCTION 

The field of machine learning has developed a wide array of techniques for improving the effective- 
ness of performance elements. Learning techniques are able to take general performance systems 
and tailor them to the eccentricities of particular domains. In this fashion, slow general systems can 
be automatically adapted into efficient problem solvers for particular domains. Unfortunately, the 
task of learning is difficult. Learning systems must operate under limited resources and must make 
many compromises in the interest of learning efficiency. These compromises appear in the form of 
design commitments implicit in the architecture of learning systems. These ramifications of these 
commitments is that they impose tradeoffs between the efficiency and usefulness of a learning tech- 
nique. The fixed nature of these commitments limits the generality of learning techniques. Ideally, 
a learning system would adapt its commitments to the demands of a particular learning situation. 
In this article we present an extension of the COMPOSER learning approach [Gratch92b] which 
dynamically adjusts its learning behavior based on the resources available for learning. 

2. UTILITY-BASED VIEW OF LEARNING 

Viewed abstractly, a learning system tailors a performance element to be effective in some environ- 
ment We will take the view that a performance element is some procedure which accepts and ex- 
ecutes a series of tasks. For example the performance element might be a classifier in which case 
each input is some feature vector and the output is a classification. Alternatively, the performance 
element could be a planner where the input is problem specifications; the output is plans. The envi- 
ronment is simply the tasks a performance element faces. Adaptation to this environment must be 
judged against some criteria for success. For example, in the case of a classifier, success is typically 
judged in terms of classification accuracy. In planning, criteria include accuracy, planning efficien- 
cy, and plan quality. 

In this paper we will advocate the utility-based view of learning adopted by several authors 
[Doyle90, Gratch92b, Greiner92b, Leck;e91, Subramanian92]. A particular environment can be 
characterized by a probability distribution over the set of possible tasks. The user provides a utility 
function which specifies his criteria for success on individual tasks. The effectiveness of a perform- 
ance element is characterized by its expected utility over the task distribution. This is the sum of 
the utility of each task weighted by the probability of a task's occurrence. For example, in classifica- 
tion, the standard utility function assigns one to a correctly classified feature vector, and a zero to 
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an incorrectly classified vector. The expected utility is simply the accuracy of the performance ele- 
ment over the distribution. 

Learning can be viewed as a transformational process where, through experience with the environ- 
ment, some initial performance element, PEq, is transformed into a performance element, PE*, with 
higher expected utility [Gratch92b, Greiner92b]. A classifier can be transformed by updating its 
representation of the concepts to be classified — for instance through specializing or generalizing 
transformations. A planner may be transformed by the addition of control knowledge including ma- 
cro-operators [Braverman88, Laird86, Markovitch89], control rules [Etzioni90, Minton88, Mitch- 
ell83], and static board evaluation functions [Utgoff91]. 

The transformations available to a learner define its vocabulary of transformations. These are essen- 
tially learning operators and collectively they define a transformation space. For instance, acquiring 
a macro-operator can be viewed as transforming the initial system (the original planner) into a new 
system (the planner operating with the macro-operator). In [Drummond90], the addition of a reac- 
tive rule transforms one subset of the universal plan into another. In [Minton88], a planner's search 
control strategy is transformed by the addition or deletion of a control rule. A learning technique 
must explore this space for a sequence of transformations which results in a better planner. 

3. LEARNING COST 

The utility-based view of learning facilitates a close analogy between learning and work in rational 
reasoning. In reasoning there is a reasoner which must choose from a set of actions, an action with 
high expected utility. In die utility-based view of learning, there is a learner which must from choose 
amongst a set of possibly transformed performance elements, a performance element with high ex- 
pected utility. In reasoning, a reasoner which always chooses the action with maximal expected util- 
ity is substantively rational [Simon76] (also called Type 1 rationality [Good71]). Similarly, we can 
define a substantively rational learning system as a system which always identifies the transformed 
performance element with maximum expected utility. 

The substantive rationality is seldom attainable in that it assumes infinite resources. This has led 
to a focus on rationality under limited resources. S imon refers to this as procedural rationality (also 
called Type 2 rationality [Good7 1 ]) because the focus is on identifying efficient procedures for mak- 
ing good-enough decisions. A procedurally rational agent relaxes the strict requirements or sub- 
stantive rationality in the interest of reasoning efficiency. The analogy between reasoning and learn- 
ing applies here as well. It is seldom reasonable to expend the resources necessary for a learning 
system to find optimal solutions. Instead we demand that our learning techniques identify good- 
enough solutions quickly. Learning techniques embody numerous constraints to achieve tractable 
behavior. For example, classification learning techniques embody biases to restrict the space of po- 
tential transformations. Learning, to plan techniques also embody numerous constraints (see 
[Gratch92a]). 

The demands of learning under limited resources can be addressed by two type of deviations from 
the substantive ideal. In one approach, the generality of a technique can be restricted to special cases. 
A learning technique retains substantive rationality as long as the learning problem falls within the 
restricted set of cases. For instance, classification learning techniques can exactly identify the target 
concept with polynomial examples when it is drawn from a restricted class like monomials or k- 
DNF [Pitt]. But this guarantee only applies if we know in advance that the target concept is a mem- 
ber of one of these restricted classes. If the concept lies outside the class, the techniques can identify 
sub-maximal representations of concepts. 



The second deviation is abandon the goal of maximizing expected utility and instead search for satis- 
factory, rather than optimal choices. This corresponds to the notion of satis/icing search [Simon75]. 
Thus, a learning system can trade-off potential gains from learning in the interest of maintaining 
efficiency. A strong learning bias can prevent a system from entertaining some of the best trans- 
formed performance elements but, hopefully, it can efficiently identify an adequate performance 
element from the reduced set 

The disadvantage of abandoning substantive rationality is that it afforded a definition of desirable 
behavior that was independent of the particular procedure which implements the rational behavior. 
The set of optimal solutions are uniquely defined by the set of choices and the utility function. We 
lose this uniqueness when we move to procedural rationality. To discuss procedurally rational learn- 
ing systems we must discuss particular policies for resolving the trade-off between utility and learn- 
ing cost. 

3.1 Fixed Policy 

For non-trivial learning problems there is a clear tradeoff between the expected utility of learned 
performance elements and the efficiency of learning. The typical approach is to adopt a fixed policy 
towards resolving this tradeoff. The learning system implementor commits to some fixed set of con- 
straints for his or her learning approach. Frequently these constraints are unarticulated and appear 
implicitly through the learning system architecture. For example, SOAR [Laird86] transforms its 
planner by acquiring macro-operators or "chunks." A particular domain theory defines the space 
of possible chunks, but only a subset of possible chunks are actively considered. These are the 
chunks which arise from problem solving imp?.sses. Furthermore, once a chunk is learned, it can 
never be forgotten. Thus SOAR is employing a restricted irrecoverable search through the sets of 
possible chunks. This is clearly more efficient that choosing among all possible sets of chunks, but 
it is less clear how this policy impacts the potential utility of the resulting planners. 

Fixed policies can be quite effective in restricted situations. Unfortunately, the same policy may not 
apply equally well in all circumstances. We may demand different behavior from our learning sys- 
tem depending on if we have a little or a lot of resources to commit towards learning. The former 
requires a highly restricted learning technique while latter would be better served by a more liberal 
policy. 

3.2 Parameterized Policy 

An alternative to a fixed policy is to allow the user some control over the behavior of the learning 
technique. This can be seen as incorporating some degrees of freedom into the learning technique 
which must be resolved by the user. An example is the user specified confidence parameter provided 
by PAC-learning techniques. Higher confidence requires more examples and thus higher learning 
cost. The user is free to resolve this tradeoff based on the demands of his or her particular circum- 
stances. 

Another example, derived by analogy to work in reasoning, is to construct "anytime" learning sys- 
tems. Anytime algorithms can be interrupted at any point with a useful result [Dean88]. Further- 
more, results become monotonically better over time. An anytime learning algorithm must at all 
times maintain a representation of some viable performance element. As learning resources are ex- 
pended, the currently represented performance element must monotonically improve in utility. This 
allows the user to arbitrarily determine the resources to commit to learning. 

Parameterized policies can greatly enhance the flexibility of a learning technique, but they also place 
greater demands on the user. Also, it is not sufficient to provide degrees of freedom. If a learning 



technique is to be useful, the user must be told how this freedom impacts the tradeoff between utility 
and efficiency. Sometimes this information is only available after learning has begun. For example, 
given an anytime learning algorithm, the decision of when to terminate might depend on how fast 
the current algorithm is improving, or worse, on how fast it will improve if we continue learning. 
If this information is not available to the user, the additional flexibility is only a burden. Under these 
circumstances, it is reasonable to build into the learning system capabilities to estimate future learn- 
ing benefit and provide this information to the user. Thus, adding useful flexibility can greatly com- 
plicate the task of designing a learning system. 

33 Rational Policy 

Incorporating degrees of freedom to a learning system increases the flexibility of an approach, but 
it also increases the demands on the user. There may be a quite complex mapping between the goals 
of the user and the setting of the various learning parameters. Ideally, the user should be able to artic- 
ulate his or her goals and leave it to the learning system to configure the policy to best satisfy the 
goals. If the learner can estimate the cost of learning and a expected improvement which results from 
it, it can use these quantities to dynamically tailor a policy which is suited to the particular learning 
task. We say a learning system incorporates a rational policy if dynamically balances the trade-off 
between learning utility and learning efficiency. A rational learner is a learning system which uses 
a rational policy. 

A rational policy requires the user to explicitly specify the relationship between utility and learning 
cost. Just as the user of a substantively rational learning approach supplies a utility function (hence- 
forth called a Type 1 utility function) which indicates his or her goals, the user of a procedurally ratio- 
nal learning system must supply a utility function (henceforth called a Type 2 utility function) which 
indicates how these goals are discounted by the cost to achieve them. 

A Type 2 utility function can be surprisingly straightforward For example, in speed-up learning 
the problem is to increase the efficiency of a problem solver. Under realistic situations, there are 
limited resources which must be divided between learning and problem solving. The obvious Type 
2 utility function is the expected number of problems which can be solved within a given resource 
limit. There is some number of problems we can expect to solve with the initial problem solver. 
If the benefits of learning greatly outweigh the resource cost, it is worthwhile expending some re- 
sources towards learning a better problem solver. By maximizing the Type 2 utility function a ratio- 
nal learning system identifies the best tradeoff between learning utility and cost. 

4. A SPECIFIC RATIONAL LEARNING TASK 

We further explore the issues of rational learning by providing a rational extension of an existing 
machine learning technique. For this we choose the COMPOSER system [Gratch92b]. COMPOS- 
ER is a speed-up learning technique which provides a statistical approach to the utility problem. 
The system identifies a sequence of transformations which, with high probability, increase the Type 
1 utility of an initial planning system. The approach breaks the task into two phase, a learning phase 
and a utilization phase. First there is a learning phase where examples are taken and transformations 
adopted. At some point, determined by the user, learning terminates and the user is expected ut uti- 
lize the final planner. The learning phase is broken down into a series of stages where after each 
stage some transformation is adopted. 

COMPOSER implements a fixed policy. The space of transformations is explored by greedy hill- 
climbing. Each new best guess is the result of applying a single transformation to the last guess. 
Which transformation to adopt is determined by drawing example problems from a fixed problem 



distribution, and measuring the change in Type 1 utility afforded by a set of possible transformations. 
The technique chooses the first transformation which reaches statistical significance using a particu- 
lar statistical technique. The first transformation to be identified may not be the transformation 
which provides the greatest change in Type 1 utility. Thus COMPOSER does not employ steepest 
ascent hill-climbing. On the other hand it requires fewer examples than what would be needed to 
identify the steepest ascent. This reflects a particular policy on the trade-off between utility and 
efficiency. COMPOSER stops learning when it exhausts a set of training examples which are pro- 
vided by the user. 

We have recently developed an extension to COMPOSER which does employ steepest ascent. This 
approach takes sufficient examples at each iteration to identify the transformation which generates 
a greater increase in Type 1 utility than any other transformation. This can take significantly more 
examples, and thus significantly more resources, than the original COMPOSER system. Depending 
on the users goals and available resources, this may or may not be an advance. We could provide 
a parameter which configures the system to behave somewhere between these two extremes. Unfor- 
tunately, the efficiency of the learning system depends on the particular learning task to which it is 
applied, and this efficiency is generally unknown before the system begins to learn. Thus the user 
may not be able to make an informed decision on how to set the parameter. For this reason we pro- 
pose a rational extension of COMPOSER where the degree of freedom is what transformation to 
adopt at each step, among a choice of alternatives which, with high probability, improve the perform- 
ance of the current planner. This extension will also internalize the decision of when to stop learning. 

Rational learning requires a Type 2 utility function. COMPOSER is a speed-up learning, technique 
which improves the efficiency, but not the accuracy of a planner. We will consider a particular Type 
2 function. The learning system should try to maximize the expected number of problems which 
can be solved after learning, given a fixed set of resources. We call this Type 2 function ENP for 
Expected Number of Problems. One might consider other utility functions, but this one seems rea- 
sonable for the class of tasks COMPOSER is intended, and it helps to illustrates several interesting 
issues that face a rational learner. 

We describe a hill-climbing approach to the problem of maximizing ENP. The learning algorithm 
proceeds by a series of stages. In each stage some number of example problems is taken and a deci- 
sion is made to terminate the learning process or to adopt a transformation. A transformation is 
adopted if two conditions are satisfied: 

i) it enhances the effectiveness of the current performance element with high probability. The ac- 
ceptable error on the ith stage is specified by Si. 1 

ii) it produces the greatest expected single-step increase in the expected number of problems which 
can be solved after learning. 

The later is the degree of freedom which the learning system can rationally control. The learning 
process hill-climbs through a sequence of performance elements, PEo, PEj f where each step is 
expected to be the largest increase in the expected number of problems which can be solved after 
learning, but there is no guarantee of global optimality. Before each stage the learning system must 
decide if it should continue to learn. If so, it must decide which transformation best satisfies the 
above criteria. Section 4.3 describes the implementation, but first we must introduce some notation. 

1. The constant 0 < 8i < 1 is the probability that the heuristic added on the /th step will improve the rth perform- 
ance element. This can be set such that the total error across all stages is less than some pre-specified constant. See 
[Greiner92a] for one strategy. 



4.1 Sequential Analysis 



The problem of identifying beneficial transformations is treatedas a problem of statistical inference. 
The learning system entertains a set of possible transformations. Example problems are drawn ran- 
domly according to the fixed problem distribution and statistics are extracted from each example. 
Many statistical inference procedures are based on a fixed sample size — the number of examples 
necessary to make a conclusion is determined in advance of any observations. The heart of our tech- 
nique utilizes a sequential statistical procedure [Govindarajulu81]. Sequential procedures differ 
from fixed-sized techniques in that the sample size is a function of the observations. Sequential pro- 
cedures provide a test called a stopping ruit which determines when sufficient examples have been 
taken. Examples are taken until the stopping rule is satisfied. The number of examples taken when 
the stopping rule is satisfied is called the stopping time. An important advantage of sequential proce- 
dures is that the average number of examples required to perform inference is typically smaUer the 
the number required by a fixed-sized technique. This is because a sequential procedure is able to 
take advantage of the information in the observations to determine the sample size. 

First we review standard statistical notation. Let X be a random variable. An observation of a ran- 
dom variable can yield one of a set of possible numeric outcomes where the likelihood of each out- 
come is determined by an associated probability distribution. X ( is the ith observation of X. EX de- 
notes the expected value of X, also called the mean, fi x , of the distribution. X n is the sample mean 

and refers to the average of n observations of X. More precisely X n « - Y x, . X n is a good estimator 

n 7-^ 

for EX. 2 

A measure of the dispersion or spread of a distribution is called the variance of a distribution. Vari- 
ance, denoted a 2 , is defined as the expected squared difference between a single observed outcome 
for X and the mean of the distribution. Formally, o 2 - E[(X - fi x ) 2 ]. The sample variance, 
l " 

55 • - ]T(X, -x n ) 2 , is a good estimator for the variance of a distribution. 

n M 

x 

The function *(*) m J ( 1 / 0.5 f}dy is the cumulative distribution function of the standard nor- 

mal (also called standard gaussian) distribution. <I> (x) is the probability that a point drawn randomly 
from a standard normal distribution will be less than or equal to jc. This function plays a important 
role in statistical estimation and inference. The Central Limit Theorem shows that, whatever the 
distribution of X, the function Jn(Z H -fi^fo 3 , approximates the distribution of the standard normal 
variable (see [Hogg78 pp. 192-195] This approximation is quite good in practice, even with small 
n. Thus, even when the distribution of X is unknown, in practice we can perform accurate statistical 
inference using a "normal approximation." 

One of the requirements in our problem specification is that each transformation improve, with high 
probability, the expected utility of the performance element. To satisfy this requirement we rely on 
a sequential stopping rule. This rule, introduced by N£d?s [Nadas69], determines if the mean of a 

2. X n is unbiased meaning the expected value of X n equals the mean of the distribution. Of all other possible 
unbiased estimators of the mean, X n has the least variance. 

3. If o 2 is unknown, we can use S 2 instead. This is better approximated with the t-distribution which converges 
to standard normal as n grows. 



6 

8 



distribution is positive or negative where the error of this inference is at most 5. The rule defines 
a stopping time, ST, as: 
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ST-min 

where a is defined by <E>(a) - 8/2, and no is some predefined positive integer. After stopping, the 
inference is made that the mean is greater than (less than) zero if the sample mean is greater than 
(less than) zero. Thus, the sequential procedure first takes a small fixed-sized sample of no examples 
(typically 3 or 7), and then continues taking examples until the stopping rule is satisfied. 

In our approach we evaluate multiple transformations simultaneously. In particular we pssign a stop- 
ping rule to each of n transformations and let them "race/* The winner of the race is transformation 
with the smallest stopping time. We then base an inference on the results of the winning stopping 
rule. If the error for each stopping rule is 8, the error of a n-way race is higher. In the worst case 
the error is /i8. 

In our rational extension we must estimate the cost of learning. Given that we are using the N£das 
stopping rule, the cost of learning will be a function of the stopping times associated with different 
transformations (this will be stated more precisely in Section *X*). Thus, one element of an estimate 
for learning cost is an estimate for stopping times. We can develop an estimator for STusing a sample 
of m examples where m < STby using Si and X m as estimators for 5? and X n , and solving the inequal- 

ity within the stopping rule for n. Thus n - a*-=~ . n must obey the further constraint that it is an 

integer greater than or equal to no- So an estimator for ST is: 

ST m « max J^o, } 



4.2 Implementation Specific Definitions 

The algorithm proceeds through a series of stages. Between each stage the algorithm decides if 
learning should continue. If the decision is to continue, the algorithm must identify a transformation 
which enhances the number of problems which can be solved after it is acquired. We use the index 
variable /, i=0,L... to indicate a specific stage. PEi denotes the performance element which exists 
at the start of the ith stage. The user supplies an initial performance element PEo. 

42.1 Transformations 

At each stage the learning system has some set of transformations it can potentially apply to the cur- 
rent performance element. In the general case, transformations can be added and removed from this 
set at any point in the learning process. Let T u denote the set of transformations available at the 

start of the yth problem within stage /, and let f , be a vector which describes how the set of transfor- 
mations changes within stage /. Typically, f , depends on the example problems, and thus may not 
be knowable until after learning. Apply is a function which transforms a performance element with 
a particular transformation. If a transformation, /, is adopted on the ith stage, the learning system 
creates a new performance element by applying the transformation; />£,+/ = APPLY(t, PEi). 
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£ 422 Resources 

Each learning stage consumes resources. Ri denotes the resources remaining at the start of the ith 
m stage. The user supplies an initial resource limit Ro. After each stage the resources available for 
the subsequent stage are reduced by whatever resources were used. 

Le; rj(PEi) denote a random variable which corresponds to the resources required to solve the jfth 

■ problem using PEi. We use the abbreviation rj where the stage number is unambiguous. rii(PEi) is 

— the average resource use of PEi over n problems. This is a good estimator for the mean resource use 
of PEi (EWEM). 

9 Let brj{tk\PEi) denote a random variable which corresponds to the incremental utility of transforma- 
tion tk on problem; over PE^ This is the change in resource use that would result on problem ; if 

■ tk were applied to PEi. We use the abbreviation Ar/*) where the stage number is unambiguous. See 
m [Gratch92b] for one description of how to obtain such values. Si/*) is the average change in re- 
source use provided by transformation t over n problems. This is a good estimator for the incremen- 

fi tal utility of the transformation (E[Ar(&IP£/)]). We can estimate the expected resource use of the 
• performance element APPLY(t, PEi) by adding the average resource use of PEi and the change in re- 
source use of transformation t given PEi. Formally, E[r(APPLY(t> PEi))] = rii(PEi) + Ar^CO. 

| 42 J Learning Cost 

A transformation provides some increment of benefit to the expected utility of a performance ele- 

■ ment. To realize this benefit we must allocate some of the available resources towards learning. 
m Under our statistical formalization of the problem, learning cost is a function of the number of exam- 

pie problems required to learn a transformation, and the cost of processing each example problem. 
B The number of examples depends on our criteria for adoption. For the current task, transformations 
must improve expected utility with high probability. Using the N£das stopping rule, the number of 

— examples required is simply the stopping time associated with the transformation, ST(t). 

m In the general case, the cost of processing the jth problem depends on several factors. It can depend 
on the particulars of the problem. In can also depend on the currently transformed performance ele- 

■ ment, PEi. For example, many learning approaches derive utility statistics by executing (or simulat- 

■ ing the execution) of the performance element on each problem. Finally, as potential transforma- 
tions must be reasoned about, learning cost can depend on the current set of transformations, T M . 

J Let XjCTu , PEi) denote the learning cost associated with the jth problem under the transformation 
se* T y and the performance element PEi. The total learning cost associated with a transformation, 

■ u is the sum of the per problem learning costs over the number of examples needed to apply the trans- 
formation. Let f, PE{) denote the learning cost for transformation / which is defined as 

m smn 

■ X(t y f iy PE,) - ]T A/T^Pf,) where ST(t) is the stopping time associated with transformation /. 

■ 42.4 When and Wfiat to Learn 

Under our formalization, the task of learning is to maximize the expected number of problems (ENP) 
m which can be solved after learning. Denote this by £WP(R, PE). This number is a function of the 

■ resources which remain after learning and the transformed performance element. Unfortunately we 
do not know these parameters until learning is complete. We are adopting a hill-climbing approach 

jm which simplifies the problem somewhat. At the start of each stage, the learning system only has to 
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decide if there exists a single transformation which improves the ENP. Thus the learning system 
must estimate the learning cost and benefit of the transformations available on a given stage. 

Let ENPi(t) be the expected number of problems which would result if transformation t was adopted 
on stage /. Estimating this value is the key to deciding of learning should be performed and if so, 
which transformation should be applied. We now consider how this can be estimated. If we are in 
stage /, by definition, R l+1 is the resources available after stage i. PE i+ i is the performance element 
which results from this stage and the expected resource use of PE i+1 is E[riPE i+1 )] . Recall that this 
is the mean resource cost to solve a problem. The expected number of problems which can be solved 
with PE i+1 given R, +1 resources is simply the ratio of the available resources and the per problem 
resource use: 



EHPE M )} 

PE i+ i is the result of transforming PE, with some transformation t*. Similarly R, +l is defined as R, 
minus the the resource cost to learn t*. The transformation r* should be the member of T, which 
yields the largest ENP. Let us consider the expected number of problems associated with a particular 
transformation. 

Let ENPi(t) be the expected number of problems which could be solved if, on stage i, t is adopted 
and learning is immediately terminated. If t were adopted, PE i+1 - Apply(u PE,) and the expected 
resource use of PE i+1 is E[r{APPLY{t, PE,))] = EHPEi) + ELr{t). The resources which remain on 
stage i+ 1 are the resources beginning stage i minus the cost to learn t, or R, +1 - R,-A(f, f „ pe,) . Thus, 
ENPiif) is defied as the ratio of R, +l to E[r(PE i+1 )]. or. 

ENP,(t)- ^'^"^ 
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Er(PE,)- EUROPE,) 

The learning system should pick the transformation which maximizes ENP. Thus, t* is the t e T, 
such that ENPtf) - max ENP(t) . Thus, to implement a solution to this rational learning task we must 

ciorive an estimator for ENP(t). 

4.2.5 Implementation Specific Assumptions 

Our rational learning approach to this specific learning problem depends on an ability to estimate 
ENPi(t). This in turn requires estimators for several parameters which depend on the unknown prob- 
lem distribution. This includes the resource use of each performance element, ehpe)} , and the 

benefit, E[Ar(t\PE)] , and learning cost, X{t,T,PE) , of each potential transformation. This places cer- 
tain demands on what information must be extracted from each example problem. To estimate re- 
source use and the benefit for each transformation, we require the learning system to determine the 
resource cost for each problem, r/P£<), and the change in this cost which each transformation could 
provide, A/jfolPE,). Theexpected resource cost and transformation benefit can be straightforwardly 
estimated by the sample mean of each of these observations, rXPE,) and Z£(/iP£ f ) . 

Estimating the learning cost is complicated by the parameter f, which may not be knowable until 
after learning is complete. For our first approach to this problem we make a number of simplifying 
assumptions. We assume that the learning system is provided with a fixed set of transformations. 
Within a stage, transformations can be discarded from this set, but never added. T, indicates the 
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set which is available at the beginning of each stage. The user or learning system designer supplies 
an initial set T 0 . After each stage i where a transformation is adopted, T M is set to T 0 minus any 
transformations from T 0 which have already been applied (we assume there is no benefit in applying 
the same transformation multiple times). This assumption simplifies some of the statistics by ensur- 
ing the each transformation has been evaluated over a same sized set of example problems. 

We further simplify the problem of estimating learning cost by assuming that the cost to evaluate 
a problem is independent of which transformations are being evaluated. Let A/pe,) be the cost of 

mo 

processing one example problem and l(t,PE f ) - be the cost to leam transformation /. This 

ass amption simplifies the problem of estimating learning cost. Let Rpe,) denote the average learn- 
ing cost across n example problems. Then for any transformation u if sr«(0 is t's stopping time, 
TtfEt) x st,(0 is a good estimator for the cost to learn t. Thus, we can use the following estimator 
for ENPfc) : 

We will discuss the consequences of relaxing these assumptions in Section 4.4. 
43 Implementation 

Learning proceeds through a series of stages. Between each stage the system must decide if it is 
worthwhile to learn for one more stage. If not, control is transferred to the performance element 
which expends the remaining resources on problem solving. If learning is expected to increase the 
ENP, the learning system must decide what to learn next. As we will see, these two questions are 
related: deciding whether to learn depends on what is learned. 

43 .1 When to Learn 

The system should continue learning for another stage if there exists some transformation of the cur- 
rent performance element which will improve the ENP. Let Rj and PE t be the current available re- 
sources and performance element (initially these are Ro and PEo). If learning is terminated at this 
decision point, the current performance element, P3, can solve some expected number of problems 

with the remaining resources. In particular, we can expect P£, to solve — - — problems. If the 

E[r(PEi)] 

learning system can identify a transformation t e T { with ENP&t) greater than this number, learning 
should proceed for at least one more stage. We will estimate the answer to this question using a fix- 
ed-sized statistical inference procedure. The learning system will process a small number of exam- 
ples and then infer if learning is worthwhile. 

For a given stage U let PEi randomly select and process no (a predetermined integer) problems. n 0 
should be chosen relatively small. Let the learning cost for those problems be k u k lr ...k^ . Let the 

resource cost for PEi over the problems be r,.^....^ . Let Ar,a t ),Ar 2 (r,) Ar^/,) be the change 

in resource use over each problem if the transformation t k is added to PE h k = 1,2, ITjL 

To determine if learning is worthwhile we will estimate ENP&t) for each transformation, based on 
the rio example problems. If we adopt a transformation, it must benefit the performance element with 
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probability 1 - Si. Our estimator for ENPtit) requires estimators of four values: the mean learning 
cost, the mean problem solving cost, the mean benefit for the transformation, and the stopping time 
for the transformation. We base these values on the following statistics, respectively: , , 37^ , 

* r r sir/)** 2 T 1 6 

and ST x {t)-max-\i%, I JL- I ^ where a satisfies This selection for a is ex- 

^ | [Ar^r)] 2 | J 4IT.I 

plained in Section 4.3.2. L / 

Learning will help if there exists some t e t, which yields an improved ENP. It suffices to consider 

R,-sri(i)JL 



the transformation with the maximum £MPi(i) . Denote this by enp { - max< 



A EH 



Recall 



that this statistic was developed in Section 4.2.5. 



is an estimator for ENP if learning is terminated without adopting a transformation. If this value 



'"0 



is larger than enp x , learning should be terminated. 

Continue learning if: enp x ^ — (Tl) 

If this inequality is true, then terminate the process of learning and commence solving problems with 
PEi. Intuitively, this means we stop learning if the ENP without learning is higher than the ENP 
afte:- learning for one more stage. If the test fails (there is some transformation which yields a higher 
ENP) we learn for at least one more stage. 

4.3.2 What to Learn 

If the learning system decides that another stage of learning is sanctioned, the system must choose 
some transformation to adopt. The definition of our learning task imposed two requirements. First, 
each applied transformation must, with high probability, improve the expected utility of the per- 
formance element. Secondly, it must choose a transformation which yields high ENP. We break 
the decision of what to learn into two steps. First the algorithm attempts to identify a single transfor- 
mation which improves utility with high probability. If none are discovered, learning is terminated. 
If such a transformation is discovered, the learning system then decides if it is worthwhile to take 
an additional set of examples with which to find a transformation with higher ENPi(t). The error 

for each of these decisions is set at y so chat the total error for the stage is at most Sj. 
4.3.2.1 Phase 1 

This phase processes problems, searching for transformations which have positive or negative incre- 
mental utility to some pre-specified error level ( ~ ). Each time a transformation demonstrates nega- 
tive incremental utility it is removed from Tj. Problems are solved until T s is exhausted (whereupon 
learning terminates), or until some transformation demonstrates positive incremental utility. We use 
the sequential procedure proposed by N&das [Nadas69]. 

Take a satisfying <b{a) - ,where 5i is a predetermined constant 0 < 8 X < 1 indicating the accept- 
able error level for accepting a beneficial transformation in stage i. A particular transformation t 
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has demonstrated its incremental utility to the specified confidence when ST(f) examples are taken. 
ST(t) is the stopping time for transformation t and it is determined by the Nddas stopping rule: 



S7\t) - max(M : 



S n 



Let the planner randomly select and solve problems until for at least one transformation, this stop- 
ping condition is satisfied. If for at least one such transformation r, the average Ar(0 at this point 
is positive, set to to be the transformation for which the stopping condition is satisfied and for which 
Ar(r) is maximum, and proceed to Phase 2. Otherwise, delete all the transformations causing the 
stopping (these have Ar(r) < 0). Keeping solving problems until either Phase 2 is reached or all the 
transformations in Tj are deleted. The later ends the whole process. 

From the results in N&das' paper, for a fixed transformation r, the decision that claiming EAKO > 
0 (EAr(r) < 0) if the average when the process stops is positive (negative) has an error probability 

(approximately) less than or equal to -~ . By Bonferroni's method, the error probability at the end 



of Phase 1 of claiming EAr(0 > 0 while it is negative, or deleting a transformation with positive 
EAr(f) is (approximately) less than or equal to y . 

432.2 Phase 2 

Following Phase 1 , to is a transformation with positive incremental utility with small error probabili- 
ty (^L ). Other members of Tj might yield a higher ENP but they have yet to demonstrate signifi- 
cance. The purpose of Phase 2 is to decide between adopting to immediately, or to solve an additional 
set of problems which will allow these other potentially better transformations to reach significance. 
The test of Dudewicz and Dalai [Dudewicz75] tells us the number of additional problems which 
have to be solved to determine if another transformation has greater incremental utility to with error 

probability y . This number can be used to determine the ENP for these other transformations. If 

no other transformation has greater ENP then to we adopt to immediately and proceed to the next 
stage. Otherwise we determine some subset of Tj which is worth investigating further. 

The Dudewicz and Dalai procedure is designed to choose, from among a population of K random 
variables, the random variable with the highest mean. The procedure identifies the correct variable 
(the one with the highest mean) with probability /?*. whenever the difference between the top two 
means is greater than or equal to some value e. In the case where the difference is less than e, the 
procedure may not select the best mean, but in this case we, with high probability, select a mean 
which is "e-close." 

The procedure is based on a multi-variate t-distribution. This plays an role analogous to the stan- 
dard normal distribution in the N&das technique. Instead of using the constant a, Dudewicz and 



Dalaldefinetheconstant/ias/z=/i m (AT,p*)betheuniquesolutionof [F m (x+ h)) K -%(x)dx ~ p* where 



p* is the probability of a correct decision, K is the number of random variables, and F m (-) and f m (-) 
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are the cumulative distribution function and density respectively of a student~t random variable with 

X *> 

m = ST-1 degrees of freedom. f m (x)~ ,F m (x)- \f m ( w )d w , m - \f^dy. 

0 

In our problem, K = ITil andp* = l - -where ^ is the acceptable error for this phase. The table of 
h m (K> P*) is given in Pudewicz75]. 

In an analogous fashion to the estimate for the stopping time for the N&das technique, Dudewicz and 
Dalai define the number of examples to pick the highest mean with a stopping time. We use t j 
to determine the additional number of examples required for each transformation. Call the number 

of examples sr 2 (r) . ST 2 (t) - ma*|s7+ 1, j"" ^ s ^ hi )j 2 "j * , 0 w here e is a predetermined num- 
ber. This is the number of problems which are required to test if t * to has an incremental utility e 
greater than to. First we will test if any transformation has higher ENP than to. This is again based 
on the statistics for ENPt(t) that we developed in Section NO TAG, but instead of using the expected 
stopping time from the N&das rule, we use the number of examples derived from the Dudewicz and 
Dalai test: 



ENP 2 - max 

M# 0 



R,-sr 2 q&r 



£W 2 is the maximum estimate of ENPi(t) of transformations other than to. 

In this case the system places to into the strategy set of the planner which finishes this stage. In this 
case we do not expect to do better than to. 

If there is at least one transformation with sufficiently high enp 2 , we want to find some subset of 
T| which is worthwhile evaluating. The Dudewicz and Dalai technique defines a number of example 
problems we must take to certify that a transformation has higher incremental utility than ^ , namely 
ST 2 (t) . If we choose to evaluate some transformation u we are forced to take at least ST 2 (t) examples. 
If we evaluate a set of transformations then we are forced to take a number of examples equal to the 
maximum ST 2 (/) of that set. Thus, while we may produce a high ENP by evaluating the entirety of 
T|, it may be worthwhile to consider some subset of the available transformations. 

Let ti be the transformation with maximum enp 2 ( t x - max ENP 2 (t) ). The transformation tj will pro- 

duce the highest expected gain in ENP 

Let ST 2 - max 



This is the number of examples required to decide if tj is truly better than to. ti may not be better 
than to and other members of Ti may. Therefore we want to continue evaluating any potentially 
beneficial transformations which can demonstrate significant improvement within S7j problems. 
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Let T x - {t E T h ST 2 (0 5 ST 2 } U f 0 (T3) 

Tj contains to plus all transformations for which we can determine a significantly higher incremental 
utility over to given ST2 problems. Let ST3 = ST2 - ST. This is the additional number of problems 
which must be processed. Let the planner select and process another ST3 problems. The system then 
adopts the transformation from 7) with the largest Ar(0 . Here, Ar(o is a weighted average of the 
observations &}-(*), i^js 57,(0, * e 7 L Ar(0 is given by 



w 2 

Ar(0 -JTa/OArXO 



the a/(/) being subject to the conditions 



573 



and 5|j</) 



Dudewicz and Dalai suggest the following strategy for setting aj(t): 

a,(r)« ... - any, 



(^3-D5T3 J 

It was proved in Dudewicz, E. J. and Dalai, S. R. (1975) that if the difference between the largest 
EAr(/) and the second largest EAr(0 in 7) is bigger than or equal to 8\ assume also that Ary(r) are 
all independent, then the probability of selecting the transformation with the largest EAr(/) is no less 
than p *. This is a reasonable test even if Ar,(0 are not independent. Thus, as to has positive incremen- 
tal utility ( Ar^(r 0 ) > 0 ) with error probability (approximately) at most y and the transformation we 
adopt has greater has incremental utility (if different than to) with error probability (approximately) 
at most y , we adopt a transformation with positive incremental utility with error probability (ap- 
proximately) at most 8j. If the several top means are very close to each other (the difference is less 
than 8*), we can not assure that the transformation we choose has higher incremental utility than ho, 
but intuitively, the procedure is unlikely to select a transformation with EAr(/) far away from the 
best. In this case, we do not lose much even if we do not get the best transformation. 

After adopting this transformation, the learning system re-initializes T i+ i to any non-adopted trans- 
formations and decides again if learning should proceed. 

In summary, learing proceeds through a series of stages. Within each stage the system must make 
two rational decisions. First the system takes a small set of example and decides determine if further 
learning would improve Type 2 utility. If so, some number of examples are taken to find a transfor- 
mation meeting the minimal requirements. Next, the system decides if this transformation should 
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be adopted immediately or if an aditional phase of learning is likely to improve Type 2. The stage 
terminates after adompting a transformation, or when it is realized that no transformation is likely 
to improve Type 2 utility. The learning system iterates through a series of stages until a decision 
is made to terminate learning, 

4.4 Discussion 

This procedure proceeds through one or more stages, producing a performance element modified 
with zero or more transformations- This continues until either the system decides it is not worth- 
while to learn, or all transformations are discarded in Phase 1 . On each stage for which a transforma- 
tion is adopted, the new performance element will be more effective with error 8 V On average the 
new performance element will also produce a larger ENP, but we cannot yet characterize a bound 
on this probability. We are investigating numerical simulations of the technique and expect results 
to be available soon. 

We adopted a number of assumptions in this present ation. We assumed that the learning cost withing 
a stage is independent the number of transformations. This may not be realistic as in speed-up learn- 
ing systems like COMPOSER where the cost of extracting incremental utility values depends on the 
number of transformations. The model can be extended by breaking learning time into two multiple 
components — time spent solving each training problem, time spent processing each training prob- 
lem, and an additional transformation specific cost which is the additional time required to evaluate 
each transformation. The transformation specific costs complicates the decision of whether to pro- 
ceed learning, for, although a set of transformations might yield a low maximum ENP, some subset 
of those transformations would have a lower per problem learning cost, and might yield a high ENP. 

The other major assumption was that the set of trasformations does not grow withing a stage. This 
also does not realistic as most speed-up learning systems consider new transformations throughout 
the learning phase. The simplest approach would be to add an initial phase where transformations 
were learned. A difficulty would be in applying rationality to deciding if it were worth learning a 
new transformations. This is because we have no information of the expected improvement of a 
transformation until we learn it. We could avoid this complication by moving to a Bayesian model. 
The learning system would then need to incorporate prior expectations on the benefit of learning a 
transformation. 

5. CONCLUSION 

Learning systems cannot produce maximal increases in performance and be maximally efficient. 
Instead it must adopt a policy which balances these two needs. Most learning techniques adopt a 
particular policy to this tradeoff. Unfortunately, fixed policies limit the generality of learning tech- 
niques. In this article we have tentatively explored the issue of rational learning policies and we 
described an extension to the COMPOSER system which adopts a rational policy. While this is only 
a first attempt at the problem, but it raises a number of interesting issues, and points to possible solu- 
tions for many of them. More importantly, it highlights an issue which is not sufficiently discussed 
in the learning community — the trade-off between learning efficiency and utility. 
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The field of machine learning has developed a wide array of techniques for improving the effective- 
ness of performance elements. Learning techniques are able to take general performance systems 
and tailor them to the eccentricities of particular domains. In this fashion, slow general systems can 
be automatically adapted into efficient problem solvers for particular domains. Unfortunately, the 
task of learning is difficult Learning systems must operate under limited resources and must make 
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commitments is that they impose tradeoffs between the efficiency and usefulness of a learning tech- 
nique. The fixed nature of these commitments limits the generality of learning techniques. Ideally, 
a learning system would adapt its commitments to the demands of a particular learning situation. 
In this article we present an extension of the COMPOSER learning approach [Gratch92b] which 
dynamically adjusts its learning behavior based on the resources available for learning. 
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